Style Guide and Tabbed HTML Output
Reading about ggplot2:
The components of a graph are different layers, such as the data itself, the dots on a scatterplot, the line of best fit, the scales, and the coordinate system. We can adjust the axis scaling with scale_x_log10() for the x-axis, for example.
The ggplot way is better, as we simply declare what we need to facet on and ggplot understands how to generate the graphics we need, instead of us needing to manually create each plot.
A geom_point, or points, is used to create a scatterplot.
The advanced parts are statistical transformations, coordinate systems, facets, and visual themes.
Themes are the overall visual defaults of a plot: background, grids, axe, default typeface, sizes, colors, etc.
The author says to use ggplot as it teaches us how to think about visualizing our data.
The critical principles are:
Critiquing Graphs:
The graph is from the following reddit link.
Looking at the thread this graph came from, this graph is meant to show simulated runs of the Martingale Roulette Strategy across 100 spins. The y-axis shows the amount of money and the x-axis shows the number of spins. Each run is a different colored line, and the main result is meant to show that the Martingale strategy with limited money and limited turns is not a good strategy for making money in real life.
This graph does not do a good job. There are no axis labels and no title to explain what the graph is about. There is an overload of information as each run of the simulation is too striking and contrasting in colors, looking more like noise than showing any structure. I would add labels and titles, as well as use the same color for all runs but with an alpha value so that overlapping areas show through more prominently.
There is almost no text to tell the user about the graph, only by reading the thread will someone understand anything. The legend at the side also serves no purpose as having an entry for every run of the simulation does not add any information and there are simply too man runs to keep track of the colors.
(3 points each)
More on Pie Charts and Rose Diagrams
Pie charts and rose diagrams are rarely the most effective way of displaying categorical data.
library(tidyverse)
# reads Matey's IMDb rated movies TV Series, etc.
mateys_imdb <- read_csv("https://raw.githubusercontent.com/mateyneykov/315_code_data/master/data/mateys_imdb_ratings.csv")
# filters only Featured Films
mateys_movies <- mateys_imdb %>% filter(`Title type` == "Feature Film")
mateys_movies <- mutate(mateys_movies,
vote_date = as.Date(mateys_movies$created,
format = "%a %b %d %H:%M:%S %Y"),
day_of_week = weekdays(vote_date),
weekend = ifelse(day_of_week %in% c("Saturday", "Sunday"),
"Weekend", "Workday"),
duration = cut(`Runtime (mins)`, c(0, 90, 120, Inf),
labels = c("Short", "Medium", "Long")),
ratings = cut(`You rated`, c(0, 4, 7, Inf),
labels = c("Low", "Med", "High")),
movie_period = cut(Year, c(0, 1980, 2000, 2018),
labels = c("Old", "Recent", "New"))
)
ggplot(mateys_movies, aes(x = "", y = movie_period, fill = movie_period)) +
geom_bar(width = 1, stat = "identity") +
coord_polar(theta = "y", start = 0) +
labs(x = "",
y = "",
fill = "Movie Period",
title = "Proportions of Movie Periods") +
theme_void()
library(gridExtra)
sbar <- ggplot(mateys_movies, aes(x = movie_period, fill = ratings)) +
geom_bar() +
labs(x = "Movie Period",
y = "Counts",
fill = "Ratings",
title = "Distributions of Ratings")
rose <- sbar + coord_polar()
grid.arrange(rose, sbar, ncol = 2)
movie_period can be seen from the radius of each petal.ratings given that the movie period is recent can be seen by how the radius is partioned for the Recent petal into 3 colors.(3 points each)
Data Manipulation and The Many Ways To Create 1-D and 2-D Bar Charts
library(tidyverse)
# reads Matey's IMDb rated movies TV Series, etc.
mateys_imdb <- read_csv("https://raw.githubusercontent.com/mateyneykov/315_code_data/master/data/mateys_imdb_ratings.csv")
# filters only Featured Films
mateys_movies <- mateys_imdb %>% filter(`Title type` == "Feature Film")
mateys_movies <- mutate(mateys_movies,
vote_date = as.Date(mateys_movies$created,
format = "%a %b %d %H:%M:%S %Y"),
day_of_week = weekdays(vote_date),
weekend = ifelse(day_of_week %in% c("Saturday", "Sunday"),
"Weekend", "Workday"),
duration = cut(`Runtime (mins)`, c(0, 90, 120, Inf),
labels = c("Short", "Medium", "Long")),
ratings = cut(`You rated`, c(0, 4, 7, Inf),
labels = c("Low", "Med", "High")),
movie_period = cut(Year, c(0, 1980, 2000, 2018),
labels = c("Old", "Recent", "New")),
less_than_7_star = ifelse(`You rated` < 7,
"Less than 7 Stars",
"7 or More Stars"),
monday = ifelse(day_of_week == "Monday", "Yes", "No")
)
The additional columns are vote_date, day_of_week, weekend, duration, ratings, and movie_period.
The function used is mutate, from dplyr, of which Hadley Wickham is the author of.
ggplot(mateys_movies, aes(x = day_of_week)) +
geom_bar() +
labs(x = "Day of Week", y = "Count", title = "Counts of Votes Across Day of Week")
dow_marginal <- mateys_movies %>%
group_by(day_of_week) %>%
summarize(count = n())
ggplot(dow_marginal, aes(x = day_of_week, y = count)) +
geom_bar(stat = "identity") +
labs(x = "Day of Week", y = "Count", title = "Counts of Votes Across Day of Week")
ggplot(mateys_movies, aes(x = day_of_week, fill = less_than_7_star)) +
geom_bar() +
labs(x = "Day of Week",
y = "Count",
title = "Counts of Votes Across Day of Week",
fill = "Less than 7 Stars")
library(tidyverse)
days_stars <- mateys_movies %>%
group_by(day_of_week, less_than_7_star) %>%
summarize(count = n())
ggplot(days_stars, aes(x = day_of_week, y = count, fill = less_than_7_star)) +
geom_bar(stat = "identity") +
labs(x = "Day of Week",
y = "Count",
title = "Counts of Votes Across Day of Week",
fill = "Less than 7 Stars")
We use `stat = "identity"` in (e) to let ggplot know that we have already provided the counts, as the default is `"count"`.
dow_star_labs <- labs(x = "Day of Week",
y = "Count",
title = "Counts of Votes Across Day of Week",
fill = "Less than 7 Stars")
g1 <- ggplot(mateys_movies, aes(x = day_of_week, fill = less_than_7_star)) +
geom_bar(position = "dodge") +
dow_star_labs
g2 <- ggplot(days_stars, aes(x = day_of_week, y = count, fill = less_than_7_star)) +
geom_bar(stat = "identity", position = "dodge") +
dow_star_labs
grid.arrange(g1, g2, ncol = 1)
dow_star_prop_labs <- labs(x = "Day of Week",
y = "Proportion",
title = "Proportions of Votes Across Day of Week",
fill = "Less than 7 Stars")
g1 <- ggplot(mateys_movies, aes(x = day_of_week, fill = less_than_7_star)) +
geom_bar(position = "fill") +
dow_star_prop_labs +
coord_flip()
g2 <- ggplot(days_stars, aes(x = day_of_week, y = count, fill = less_than_7_star)) +
geom_bar(stat = "identity", position = "fill") +
dow_star_prop_labs +
theme(axis.text.x = element_text(angle = 45))
grid.arrange(g1, g2, ncol = 1)
Rotated second graph in (g).
Flipped first graph in (g)
It is easier to see the marginal distribution over days of week with the stacked bar chart, conditional distributions given days of week with the side-by-side bar chart, and relative proportions in the proportional bar charts. Each chart has its own advantages and should be used depending on what kind of distribution we want to focus on.
ggplot(mateys_movies, aes(x = day_of_week)) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
labs(x = "Day of Week",
y = "Proportion",
title = "Proportions of Votes Across Day of Week")
ggplot(mateys_movies, aes(x = day_of_week)) +
geom_bar(aes(y = 100 * (..count..) / sum(..count..))) +
labs(x = "Day of Week",
y = "Percentage",
title = "Percentage of Votes Across Day of Week")
(3 points each)
Reordering Categories and Bars
The default ordering of character variables is alphabetical.
Rename with fct_recode.
Change order by first appearance with fct_inorder.
Change order by frequency with fct_infreq.
Change order by reverse frequency with fct_infreq then fct_rev.
library(forcats)
ggplot(days_stars, aes(x = fct_rev(fct_reorder(day_of_week, count)), y = count, fill = less_than_7_star)) +
geom_bar(stat = "identity") +
labs(x = "Day of Week",
y = "Count",
title = "Counts of Votes Across Day of Week",
fill = "Less than 7 Stars")
g1 <- ggplot(mateys_movies, aes(x = fct_rev(fct_infreq(day_of_week)), fill = less_than_7_star)) +
geom_bar(position = "dodge") +
dow_star_labs
g2 <- ggplot(days_stars, aes(x = fct_reorder(day_of_week, count), y = count, fill = less_than_7_star)) +
geom_bar(stat = "identity", position = "dodge") +
dow_star_labs
grid.arrange(g1, g2, ncol = 1)
dow_abbr <- c(Su = "Sunday", M = "Monday", Tu = "Tuesday", W = "Wednesday",
Th = "Thursday", F = "Friday", Sa = "Saturday")
g1 <- ggplot(mateys_movies, aes(x = fct_relevel(day_of_week, dow_order),
fill = less_than_7_star)) +
geom_bar(position = "fill") +
dow_star_prop_labs
g2 <- ggplot(days_stars, aes(x = fct_relevel(day_of_week, dow_order),
y = count, fill = less_than_7_star)) +
geom_bar(stat = "identity", position = "fill") +
dow_star_prop_labs +
theme(axis.text.x = element_text(angle = 45))
grid.arrange(g1, g2, ncol = 1)
dow_abbr <- c(Su = "Sunday", M = "Monday", Tu = "Tuesday", W = "Wednesday",
Th = "Thursday", F = "Friday", Sa = "Saturday")
g1 <- ggplot(mateys_movies, aes(x = fct_recode(fct_relevel(day_of_week, dow_order), !!!dow_abbr),
fill = less_than_7_star)) +
geom_bar(position = "fill") +
dow_star_prop_labs
g2 <- ggplot(days_stars, aes(x = fct_recode(fct_relevel(day_of_week, dow_order), !!!dow_abbr),
y = count, fill = less_than_7_star)) +
geom_bar(stat = "identity", position = "fill") +
dow_star_prop_labs +
theme(axis.text.x = element_text(angle = 45))
grid.arrange(g1, g2, ncol = 1)
(3 points each)
Incorporating Statistical Information Into Graphs
# Add the following information to the matey_counts dataset:
# Proportions and percentages corresponding to each category
# The standard error on the proportions or percentages corresponding to each
# category
# Lower bound of an (approximate) 95% confidence interval around the true
# proportion in each category
# Upper bound of an (approximate) 95% confidence interval around the true
# proportion in each category
# Manipulate the day_of_week
days_movie_counts <- mateys_movies %>% # Start with the mateys_movies data.frame
group_by(day_of_week) %>% # group by the days_of_week variable
summarise(count = n()) %>% # summarize the dataset by calculating the count of each day of the week
mutate(total = sum(count), # add total number of observations
proportion = count / total, # add proportions
percentage = proportion * 100, # add percentages
std_error = sqrt(proportion * (1 - proportion) / total), # add standard error of each proportion
lower = proportion - 1.96 * std_error, # compute lower bound
upper = proportion + 1.96 * std_error) # compute upper bound
days_movie_counts
## # A tibble: 7 x 8
## day_of_week count total proportion percentage std_error lower upper
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Friday 59 737 0.0801 8.01 0.01000 0.0605 0.0996
## 2 Monday 164 737 0.223 22.3 0.0153 0.192 0.253
## 3 Saturday 125 737 0.170 17.0 0.0138 0.143 0.197
## 4 Sunday 174 737 0.236 23.6 0.0156 0.205 0.267
## 5 Thursday 80 737 0.109 10.9 0.0115 0.0861 0.131
## 6 Tuesday 70 737 0.0950 9.50 0.0108 0.0738 0.116
## 7 Wednesday 65 737 0.0882 8.82 0.0104 0.0677 0.109
Fixed standard error.
The difference in the two graphs is in the y-axis scale.
g1 <- ggplot(days_movie_counts, aes(x = day_of_week, y = percentage)) +
geom_bar(stat = "identity") +
labs(x = "Day of Week", y = "Percentage", title = "Percentages of Movies by Day of Week") +
theme(axis.text.x = element_text(angle = 45))
g2 <- ggplot(days_movie_counts, aes(x = day_of_week, y = proportion)) +
geom_bar(stat = "identity") +
labs(x = "Day of Week", y = "Proportion", title = "Proportions of Movies by Day of Week") +
theme(axis.text.x = element_text(angle = 45))
grid.arrange(g1, g2, ncol = 2)
ggplot(days_movie_counts, aes(x = day_of_week, y = proportion)) +
geom_bar(stat = "identity") +
geom_errorbar(aes(ymin = lower, ymax = upper)) +
labs(x = "Day of Week", y = "Proportion", title = "Proportions of Movies by Day of Week") +
theme(axis.text.x = element_text(angle = 45))
(3 points each)
Adjusting Legends
The following code will produce a graph of the marginal distribution of the movie periods of the movies Matey ranked, and the conditional distributions of Matey’s ratings given movie_period.
ggplot(mateys_movies, aes(x = movie_period, fill = ratings)) +
geom_bar() +
labs(fill = "Matey's Rating")
dow_abbr <- c(Su = "Sunday", M = "Monday", Tu = "Tuesday", W = "Wednesday",
Th = "Thursday", F = "Friday", Sa = "Saturday")
g1 <- ggplot(mateys_movies, aes(x = fct_recode(fct_relevel(day_of_week, dow_order), !!!dow_abbr),
fill = less_than_7_star)) +
geom_bar(position = "fill") +
dow_star_prop_labs
g2 <- ggplot(days_stars, aes(x = fct_recode(fct_relevel(day_of_week, dow_order), !!!dow_abbr),
y = count, fill = less_than_7_star)) +
geom_bar(stat = "identity", position = "fill") +
dow_star_prop_labs +
theme(axis.text.x = element_text(angle = 45))
grid.arrange(g1, g2, ncol = 1)
g1 <- g1 + theme(legend.position = "bottom")
g2 <- g2 + theme(legend.position = "bottom")
grid.arrange(g1, g2, ncol = 1)